0.906
.
images/
in your root directory.
📘 Parameters forfastdup will start analyzing the dataset for potential issues. How long it takes to complete the run depends on your computing power.fastdup.create
work_dir
- Path to store the artifacts generated from the run.input_dir
- Path to the images.
🚧 Run time On Google Colab (free version with 2 CPU cores) it takes a little over 3 minutes to complete the run!Once the run finishes, we can visualize all the issues found.
Pandas
DataFrame
:
img_filename | fastdup_id | error_code | is_valid |
---|---|---|---|
Abyssinian_34.jpg | 135 | ERROR_ZERO_SIZE_FILE | False |
Egyptian_Mau_139.jpg | 2240 | ERROR_ZERO_SIZE_FILE | False |
Egyptian_Mau_145.jpg | 2247 | ERROR_ZERO_SIZE_FILE | False |
Egyptian_Mau_167.jpg | 2268 | ERROR_ZERO_SIZE_FILE | False |
Egyptian_Mau_177.jpg | 2278 | ERROR_ZERO_SIZE_FILE | False |
Egyptian_Mau_191.jpg | 2293 | ERROR_ZERO_SIZE_FILE | False |
📘 Something unexpected Broken images are something we did not expect to see, especially with a curated dataset like the Oxford IIT Pet Dataset. But this shows how easily you can detect them with fastdup with just one line of code.
📘 Under the hood The above gallery shows duplicate pairs computed using thecosine
distance. A distance of1.0
indicates the image pair is an exact duplicate. TIP: You can specifynum_images
as a parameter tofd.vis.duplicates_gallery
to see more or fewer image pairs. For example:fd.vis.duplicates_gallery(num_images=5)
📘 Verify outliers As you can see, not all images in the outliers report are true outliers. The images appear on the report simply because they look different from other images in the dataset (distance-wise). As a curator, you’d need to verify if they are true outliers by inspecting the report.
metric='dark'
we can visualize the darkest images from the dataset.
metric='bright'
populates the gallery with the brightest images on top.
metric='blur'
shows the blurriest images on top.
📘 Verify dark, bright and blurry images Again, we see that not all images in the statistical visualization gallery are problematic. As a curator, you’d need to verify and filter out the problematic images. Since the Oxford Pets Dataset is a curated dataset, we’d not expect to find extremely bright, dark or blurry images.
📘 Term Clusters are known as ‘components’ in fastdup. You’d see the term ‘component’ used more frequently in code and documentation.
👍 TLDR In this tutorial we’ve seen how to use fastdup to find:What to do about all the problematic images? You can decide to keep or eliminate them. Check out the next tutorial!
- Broken images.
- Duplicate image pairs.
- Outliers.
- Dark, bright and blurry images.
- Image clusters.